• Is the share of female Nobel prize winners becoming more equal over time?
  • From which country origin most Nobel prize winners?
  • How does your country rank with respect to the number of Nobel prize winners in history?
  • What is the trend in age with respect to Nobel prize winners (per prize category)?
  • Which generations are generating most Nobel prize winners?

Training: https://learn.datacamp.com/projects/nobel-winners
Inspiration: https://www.kaggle.com/kenjee
Documentation: https://seaborn.pydata.org/

Table of Contents¶

Notebook Setup¶

  • Packages
  • Source Data

Data Preprocessing¶

  • Add feature Decade
  • Add feature Age
  • Add feature Generation
  • Adjusting for duplicate records

Data Visualization¶

  • Nobel prizes by Sex
  • Nobel prizes by Country
  • Nobel prizes by Age
  • Nobel prizes by Generation

Notebook Setup¶

Packages¶

In [2]:
pip install geopandas
Collecting geopandas
  Downloading geopandas-0.12.1-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 1.1 MB/s eta 0:00:01
Collecting shapely>=1.7
  Downloading Shapely-1.8.5.post1-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 1.8 MB/s eta 0:00:01
Collecting fiona>=1.8
  Downloading Fiona-1.8.22-cp39-cp39-macosx_10_10_x86_64.whl (26.5 MB)
     |████████████████████████████████| 26.5 MB 7.6 MB/s eta 0:00:01
Requirement already satisfied: pandas>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from geopandas) (1.3.4)
Collecting pyproj>=2.6.1.post1
  Downloading pyproj-3.4.0-cp39-cp39-macosx_10_9_x86_64.whl (8.0 MB)
     |████████████████████████████████| 8.0 MB 11.2 MB/s eta 0:00:01
Requirement already satisfied: packaging in /opt/anaconda3/lib/python3.9/site-packages (from geopandas) (21.0)
Collecting munch
  Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting click-plugins>=1.0
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.9/site-packages (from fiona>=1.8->geopandas) (58.0.4)
Requirement already satisfied: certifi in /opt/anaconda3/lib/python3.9/site-packages (from fiona>=1.8->geopandas) (2022.9.24)
Requirement already satisfied: click>=4.0 in /opt/anaconda3/lib/python3.9/site-packages (from fiona>=1.8->geopandas) (8.0.3)
Requirement already satisfied: six>=1.7 in /opt/anaconda3/lib/python3.9/site-packages (from fiona>=1.8->geopandas) (1.16.0)
Requirement already satisfied: attrs>=17 in /opt/anaconda3/lib/python3.9/site-packages (from fiona>=1.8->geopandas) (21.2.0)
Collecting cligj>=0.5
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas>=1.0.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas>=1.0.0->geopandas) (2021.3)
Requirement already satisfied: numpy>=1.17.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas>=1.0.0->geopandas) (1.19.2)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/anaconda3/lib/python3.9/site-packages (from packaging->geopandas) (3.0.4)
Installing collected packages: munch, cligj, click-plugins, shapely, pyproj, fiona, geopandas
Successfully installed click-plugins-1.1.1 cligj-0.7.2 fiona-1.8.22 geopandas-0.12.1 munch-2.5.0 pyproj-3.4.0 shapely-1.8.5.post1
Note: you may need to restart the kernel to use updated packages.
In [6]:
import os # operating system (files)
import numpy as np  # Linear algebra
import pandas as pd  # Data processing
import geopandas as gpd  # Geometry data for plotting data on (world) maps 
import seaborn as sns  # Data visualization
import matplotlib.pyplot as plt  # Data visualization
from matplotlib.ticker import PercentFormatter  # Format axis in percentages
from mpl_toolkits.axes_grid1 import make_axes_locatable  # Scale axis of (world) maps

Source Data¶

In [128]:
df_nobel = pd.read_csv('archive.csv')
display(df_nobel.tail(50))
Year Category Prize Motivation Prize Share Laureate ID Laureate Type Full Name Birth Date Birth City Birth Country Sex Organization Name Organization City Organization Country Death Date Death City Death Country
919 2013 Economics The Sveriges Riksbank Prize in Economic Scienc... "for their empirical analysis of asset prices" 1/3 895 Individual Lars Peter Hansen 1952-10-26 Urbana, IL United States of America Male University of Chicago Chicago, IL United States of America NaN NaN NaN
920 2013 Economics The Sveriges Riksbank Prize in Economic Scienc... "for their empirical analysis of asset prices" 1/3 896 Individual Robert J. Shiller 1946-03-29 Detroit, MI United States of America Male Yale University New Haven, CT United States of America NaN NaN NaN
921 2013 Literature The Nobel Prize in Literature 2013 "master of the contemporary short story" 1/1 892 Individual Alice Munro 1931-07-10 Wingham Canada Female NaN NaN NaN NaN NaN NaN
922 2013 Medicine The Nobel Prize in Physiology or Medicine 2013 "for their discoveries of machinery regulating... 1/3 884 Individual James E. Rothman 1950-11-03 Haverhill, MA United States of America Male Yale University New Haven, CT United States of America NaN NaN NaN
923 2013 Medicine The Nobel Prize in Physiology or Medicine 2013 "for their discoveries of machinery regulating... 1/3 885 Individual Randy W. Schekman 1948-12-30 St. Paul, MN United States of America Male University of California Berkeley, CA United States of America NaN NaN NaN
924 2013 Medicine The Nobel Prize in Physiology or Medicine 2013 "for their discoveries of machinery regulating... 1/3 885 Individual Randy W. Schekman 1948-12-30 St. Paul, MN United States of America Male Howard Hughes Medical Institute NaN NaN NaN NaN NaN
925 2013 Medicine The Nobel Prize in Physiology or Medicine 2013 "for their discoveries of machinery regulating... 1/3 886 Individual Thomas C. Südhof 1955-12-22 Göttingen Germany Male Stanford University Stanford, CA United States of America NaN NaN NaN
926 2013 Medicine The Nobel Prize in Physiology or Medicine 2013 "for their discoveries of machinery regulating... 1/3 886 Individual Thomas C. Südhof 1955-12-22 Göttingen Germany Male Howard Hughes Medical Institute NaN NaN NaN NaN NaN
927 2013 Peace The Nobel Peace Prize 2013 "for its extensive efforts to eliminate chemic... 1/1 893 Organization Organisation for the Prohibition of Chemical W... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
928 2013 Physics The Nobel Prize in Physics 2013 "for the theoretical discovery of a mechanism ... 1/2 887 Individual François Englert 1932-11-06 Etterbeek Belgium Male Université Libre de Bruxelles Brussels Belgium NaN NaN NaN
929 2013 Physics The Nobel Prize in Physics 2013 "for the theoretical discovery of a mechanism ... 1/2 888 Individual Peter W. Higgs 1929-05-29 Newcastle upon Tyne United Kingdom Male University of Edinburgh Edinburgh United Kingdom NaN NaN NaN
930 2014 Chemistry The Nobel Prize in Chemistry 2014 "for the development of super-resolved fluores... 1/3 909 Individual Eric Betzig 1960-01-13 Ann Arbor, MI United States of America Male Janelia Research Campus, Howard Hughes Medical... Ashburn, VA United States of America NaN NaN NaN
931 2014 Chemistry The Nobel Prize in Chemistry 2014 "for the development of super-resolved fluores... 1/3 910 Individual Stefan W. Hell 1962-12-23 Arad Romania Male Max Planck Institute for Biophysical Chemistry Göttingen Germany NaN NaN NaN
932 2014 Chemistry The Nobel Prize in Chemistry 2014 "for the development of super-resolved fluores... 1/3 910 Individual Stefan W. Hell 1962-12-23 Arad Romania Male German Cancer Research Center Heidelberg Germany NaN NaN NaN
933 2014 Chemistry The Nobel Prize in Chemistry 2014 "for the development of super-resolved fluores... 1/3 911 Individual William E. Moerner 1953-06-24 Pleasanton, CA United States of America Male Stanford University Stanford, CA United States of America NaN NaN NaN
934 2014 Economics The Sveriges Riksbank Prize in Economic Scienc... "for his analysis of market power and regulation" 1/1 915 Individual Jean Tirole 1953-08-09 Troyes France Male Toulouse School of Economics (TSE) Toulouse France NaN NaN NaN
935 2014 Literature The Nobel Prize in Literature 2014 "for the art of memory with which he has evoke... 1/1 912 Individual Patrick Modiano 1945-07-30 Paris France Male NaN NaN NaN NaN NaN NaN
936 2014 Medicine The Nobel Prize in Physiology or Medicine 2014 "for their discoveries of cells that constitut... 1/2 903 Individual John O'Keefe 1939-11-18 New York, NY United States of America Male University College London United Kingdom NaN NaN NaN
937 2014 Medicine The Nobel Prize in Physiology or Medicine 2014 "for their discoveries of cells that constitut... 1/4 904 Individual May-Britt Moser 1963-01-04 Fosnavåg Norway Female Norwegian University of Science and Technology... Trondheim Norway NaN NaN NaN
938 2014 Medicine The Nobel Prize in Physiology or Medicine 2014 "for their discoveries of cells that constitut... 1/4 905 Individual Edvard I. Moser 1962-04-27 Ålesund Norway Male Norwegian University of Science and Technology... Trondheim Norway NaN NaN NaN
939 2014 Peace The Nobel Peace Prize 2014 "for their struggle against the suppression of... 1/2 913 Individual Kailash Satyarthi 1954-01-11 Vidisha India Male NaN NaN NaN NaN NaN NaN
940 2014 Peace The Nobel Peace Prize 2014 "for their struggle against the suppression of... 1/2 914 Individual Malala Yousafzai 1997-07-12 Mingora Pakistan Female NaN NaN NaN NaN NaN NaN
941 2014 Physics The Nobel Prize in Physics 2014 "for the invention of efficient blue light-emi... 1/3 906 Individual Isamu Akasaki 1929-01-30 Chiran Japan Male Meijo University Nagoya Japan NaN NaN NaN
942 2014 Physics The Nobel Prize in Physics 2014 "for the invention of efficient blue light-emi... 1/3 906 Individual Isamu Akasaki 1929-01-30 Chiran Japan Male Nagoya University Nagoya Japan NaN NaN NaN
943 2014 Physics The Nobel Prize in Physics 2014 "for the invention of efficient blue light-emi... 1/3 907 Individual Hiroshi Amano 1960-09-11 Hamamatsu Japan Male Nagoya University Nagoya Japan NaN NaN NaN
944 2014 Physics The Nobel Prize in Physics 2014 "for the invention of efficient blue light-emi... 1/3 908 Individual Shuji Nakamura 1954-05-22 Ikata Japan Male University of California Santa Barbara, CA United States of America NaN NaN NaN
945 2015 Chemistry The Nobel Prize in Chemistry 2015 "for mechanistic studies of DNA repair" 1/3 921 Individual Tomas Lindahl 1938-01-28 Stockholm Sweden Male Francis Crick Institute Hertfordshire United Kingdom NaN NaN NaN
946 2015 Chemistry The Nobel Prize in Chemistry 2015 "for mechanistic studies of DNA repair" 1/3 921 Individual Tomas Lindahl 1938-01-28 Stockholm Sweden Male Clare Hall Laboratory Hertfordshire United Kingdom NaN NaN NaN
947 2015 Chemistry The Nobel Prize in Chemistry 2015 "for mechanistic studies of DNA repair" 1/3 922 Individual Paul Modrich 1946-06-13 Raton, NM United States of America Male Howard Hughes Medical Institute Durham, NC United States of America NaN NaN NaN
948 2015 Chemistry The Nobel Prize in Chemistry 2015 "for mechanistic studies of DNA repair" 1/3 922 Individual Paul Modrich 1946-06-13 Raton, NM United States of America Male Duke University School of Medicine Durham, NC United States of America NaN NaN NaN
949 2015 Chemistry The Nobel Prize in Chemistry 2015 "for mechanistic studies of DNA repair" 1/3 923 Individual Aziz Sancar 1946-09-08 Savur Turkey Male University of North Carolina Chapel Hill, NC United States of America NaN NaN NaN
950 2015 Economics The Sveriges Riksbank Prize in Economic Scienc... "for his analysis of consumption, poverty, and... 1/1 926 Individual Angus Deaton 1945-10-19 Edinburgh United Kingdom Male Princeton University Princeton, NJ United States of America NaN NaN NaN
951 2015 Literature The Nobel Prize in Literature 2015 "for her polyphonic writings, a monument to su... 1/1 924 Individual Svetlana Alexievich 1948-05-31 Ivano-Frankivsk Ukraine Female NaN NaN NaN NaN NaN NaN
952 2015 Medicine The Nobel Prize in Physiology or Medicine 2015 "for their discoveries concerning a novel ther... 1/4 916 Individual William C. Campbell 1930-06-28 Ramelton Ireland Male Drew University Madison, NJ United States of America NaN NaN NaN
953 2015 Medicine The Nobel Prize in Physiology or Medicine 2015 "for their discoveries concerning a novel ther... 1/4 917 Individual Satoshi Ōmura 1935-07-12 Yamanashi Prefecture Japan Male Kitasato University Tokyo Japan NaN NaN NaN
954 2015 Medicine The Nobel Prize in Physiology or Medicine 2015 "for her discoveries concerning a novel therap... 1/2 918 Individual Youyou Tu 1930-12-30 Zhejiang Ningbo China Female China Academy of Traditional Chinese Medicine Beijing China NaN NaN NaN
955 2015 Peace The Nobel Peace Prize 2015 "for its decisive contribution to the building... 1/1 925 Organization National Dialogue Quartet NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
956 2015 Physics The Nobel Prize in Physics 2015 "for the discovery of neutrino oscillations, w... 1/2 919 Individual Takaaki Kajita 1959-03-09 Higashimatsuyama Japan Male University of Tokyo Kashiwa Japan NaN NaN NaN
957 2015 Physics The Nobel Prize in Physics 2015 "for the discovery of neutrino oscillations, w... 1/2 920 Individual Arthur B. McDonald 1943-08-29 Sydney Canada Male Queen's University Kingston Canada NaN NaN NaN
958 2016 Chemistry The Nobel Prize in Chemistry 2016 "for the design and synthesis of molecular mac... 1/3 931 Individual Jean-Pierre Sauvage 1944-10-21 Paris France Male University of Strasbourg Strasbourg France NaN NaN NaN
959 2016 Chemistry The Nobel Prize in Chemistry 2016 "for the design and synthesis of molecular mac... 1/3 932 Individual Sir J. Fraser Stoddart 1942-05-24 Edinburgh United Kingdom Male Northwestern University Evanston, IL United States of America NaN NaN NaN
960 2016 Chemistry The Nobel Prize in Chemistry 2016 "for the design and synthesis of molecular mac... 1/3 933 Individual Bernard L. Feringa 1951-05-18 Barger-Compascuum Netherlands Male University of Groningen Groningen Netherlands NaN NaN NaN
961 2016 Economics The Sveriges Riksbank Prize in Economic Scienc... "for their contributions to contract theory" 1/2 935 Individual Oliver Hart 1948-10-09 London United Kingdom Male Harvard University Cambridge, MA United States of America NaN NaN NaN
962 2016 Economics The Sveriges Riksbank Prize in Economic Scienc... "for their contributions to contract theory" 1/2 936 Individual Bengt Holmström 1949-04-18 Helsinki Finland Male Massachusetts Institute of Technology (MIT) Cambridge, MA United States of America NaN NaN NaN
963 2016 Literature The Nobel Prize in Literature 2016 "for having created new poetic expressions wit... 1/1 937 Individual Bob Dylan 1941-05-24 Duluth, MN United States of America Male NaN NaN NaN NaN NaN NaN
964 2016 Medicine The Nobel Prize in Physiology or Medicine 2016 "for his discoveries of mechanisms for autophagy" 1/1 927 Individual Yoshinori Ohsumi 1945-02-09 Fukuoka Japan Male Tokyo Institute of Technology Tokyo Japan NaN NaN NaN
965 2016 Peace The Nobel Peace Prize 2016 "for his resolute efforts to bring the country... 1/1 934 Individual Juan Manuel Santos 1951-08-10 Bogotá Colombia Male NaN NaN NaN NaN NaN NaN
966 2016 Physics The Nobel Prize in Physics 2016 "for theoretical discoveries of topological ph... 1/2 928 Individual David J. Thouless 1934-09-21 Bearsden United Kingdom Male University of Washington Seattle, WA United States of America NaN NaN NaN
967 2016 Physics The Nobel Prize in Physics 2016 "for theoretical discoveries of topological ph... 1/4 929 Individual F. Duncan M. Haldane 1951-09-14 London United Kingdom Male Princeton University Princeton, NJ United States of America NaN NaN NaN
968 2016 Physics The Nobel Prize in Physics 2016 "for theoretical discoveries of topological ph... 1/4 930 Individual J. Michael Kosterlitz 1943-06-22 Aberdeen United Kingdom Male Brown University Providence, RI United States of America NaN NaN NaN
In [8]:
def null_count_by_column(df):
    """Lists number of missing values per column if n missing values > 0"""
    print(f'DataFrame shape: {df.shape}', end='\n\n')
    col_missing_values = (df.isnull().sum()).sort_values(ascending=False)
    print(f'DataFrame feature # missing values: \n{col_missing_values[col_missing_values > 0]}')


print(null_count_by_column(df_nobel))
DataFrame shape: (969, 18)

DataFrame feature # missing values: 
Death City              370
Death Country           364
Death Date              352
Organization Country    253
Organization City       253
Organization Name       247
Motivation               88
Birth Date               29
Birth City               28
Birth Country            26
Sex                      26
dtype: int64
None

Data Preprocessing¶

Add feature Decade¶

Adding a feature to the dataset indicating the the respective 'Decade' per record based on the 'Year' the Nobel prize was awarded.

In [9]:
df_nobel['Decade'] = df_nobel['Year'].apply(lambda x: np.floor(x / 10) * 10).astype(int)
print(f'Unique values for added Decade in the dataset: {df_nobel.Decade.unique()}')
Unique values for added Decade in the dataset: [1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010]

Add feature Age¶

Adding a feature to the dataset indicating the respective 'Age' per record based on 'Birth Date' and the 'Year' the Nobel prize was awarded. In addition each record is allocated to an 'Age_Group' based on the calculated 'Age'

In [10]:
df_nobel['Birth Date'] = pd.to_datetime(df_nobel['Birth Date'], errors='coerce')
df_nobel['Age'] = df_nobel['Year'] - df_nobel['Birth Date'].dt.year
df_nobel['Age_Group'] = pd.cut(df_nobel['Age'], bins=[0, 18, 30, 64, 99],
                               labels=['Youth', 'Young Adult', 'Adult', 'Senior'])

print('Relative share of Nobel prize winners per added Age_Group:')
display(df_nobel['Age_Group'].value_counts(normalize=True, sort=False).to_frame())
Relative share of Nobel prize winners per added Age_Group:
Age_Group
Youth 0.001066
Young Adult 0.001066
Adult 0.655650
Senior 0.342217

Add feature Generation¶

Adding a feature to the dataset indicating the respective 'Generation' per record based on 'Birth Date'.

In [11]:
# Source: https://en.wikipedia.org/wiki/Generation#/media/File:Generation_timeline.svg
generations = ['Ancient', 'Lost Generation', 'Greatest Generation', 'Silent Generation', 'Boomers I', 'Boomers II', 'Generation X', 'Millenials (Y)', 'Generation Z', 'Generation Alpha']
age_bins = [min(df_nobel['Birth Date'].dt.year), 1883, 1900, 1927, 1945, 1955, 1965, 1980, 1996, 2012, 2021]
df_nobel['Generation'] = pd.cut(df_nobel['Birth Date'].dt.year, age_bins, labels=generations)
display(df_nobel['Generation'].value_counts(normalize=True, sort=False).to_frame())
Generation
Ancient 0.202775
Lost Generation 0.122732
Greatest Generation 0.323372
Silent Generation 0.243330
Boomers I 0.076841
Boomers II 0.023479
Generation X 0.006403
Millenials (Y) 0.000000
Generation Z 0.001067
Generation Alpha 0.000000

Duplicate records¶

Individual winners of the nobel prize are identified by 'Laureate ID'. Paul Modrich is listed twice for receiving a nobel prize in the same 'Category' in the same 'Year', listing a different 'Organization Name'. These records of Paul Modrich indicate he was related to 2 organizations when winning the nobel prize, not winning the nobel prize twice. Hence these records will not be considered when analyzing the number of nobel prizes. Marie Curie has been awarded the nobel prize twice, once in 1903 for Physics and once in 1911 for Chemistry. When analyzing the number of nobel prizes we do consider these records as individual awards.

In [120]:
df_nobel_prizes = df_nobel.drop_duplicates(subset=['Year', 'Category', 'Laureate ID'])
print(f'Number of (possibly shared) Nobel Prizes handed out between 1901 and 2016: {len(df_nobel_prizes)}')
Number of (possibly shared) Nobel Prizes handed out between 1901 and 2016: 911

Saving CSV file for data visualization in Tableau

In [13]:
from pathlib import Path  
filepath = Path('BigDataProject/out.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
df_nobel.to_csv(filepath)  

Data Visualization¶

In [14]:
!pip install seaborn 
Requirement already satisfied: seaborn in /opt/anaconda3/lib/python3.9/site-packages (0.11.2)
Requirement already satisfied: matplotlib>=2.2 in /opt/anaconda3/lib/python3.9/site-packages (from seaborn) (3.4.3)
Requirement already satisfied: pandas>=0.23 in /opt/anaconda3/lib/python3.9/site-packages (from seaborn) (1.3.4)
Requirement already satisfied: numpy>=1.15 in /opt/anaconda3/lib/python3.9/site-packages (from seaborn) (1.19.2)
Requirement already satisfied: scipy>=1.0 in /opt/anaconda3/lib/python3.9/site-packages (from seaborn) (1.6.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.4.0)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.8.2)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.9/site-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.16.0)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2021.3)
In [15]:
import seaborn.apionly as sns
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(10,12))
ChemestryGraph = sns.countplot(y="Birth Country", data=ChemistryDF,
              order=ChemistryDF['Birth Country'].value_counts().index,
              palette='GnBu_d')
plt.show()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/var/folders/m1/8dggmjzn5tq4bqm0wf1dq92c0000gn/T/ipykernel_3016/717565735.py in <module>
----> 1 import seaborn.apionly as sns
      2 get_ipython().run_line_magic('matplotlib', 'inline')
      3 import matplotlib.pyplot as plt
      4 
      5 plt.figure(figsize=(10,12))

ModuleNotFoundError: No module named 'seaborn.apionly'
In [20]:
data=df_nobel
In [21]:
ChemistryDF = data[(data.Category == 'Chemistry')]
EconomicsDF = data[(data.Category == 'Economics')]
LiteratureDF = data[(data.Category == 'Literature')]
MedicineDF = data[(data.Category == 'Medicine')]
PeaceDF = data[(data.Category == 'Peace')]
PhysicsDF = data[(data.Category == 'Physics')]

Chemistry¶

In [114]:
femaledata=data[(data.Sex == 'Female')]
In [217]:
plt.figure(figsize=(10,12))
FemaleGraph = sns.countplot(y="Birth Country", data=femaledata,
              order=femaledata['Birth Country'].value_counts().index,
              palette='GnBu_d')
plt.show()
In [ ]:
plt.figure(figsize=(10,12))
ChemistryGraph = sns.countplot(y="Birth Country", data=ChemistryDF,
              order=ChemistryDF['Birth Country'].value_counts().index,
              palette='GnBu_d')
plt.show()
In [ ]:
 

Nobel prizes by Sex¶

In [112]:
# Mmh, looks like the majority of winners is male, but that there is a slight increase in female laureates
fig, ax = plt.subplots(figsize=(10, 10))
sns.countplot(data=df_nobel_prizes, x='Decade',hue='Sex', ax=ax)
ax.set_title(f'Countplot of the number of Nobel Prizes won in history:')
ax.set_ylabel('Nobel prize count')
ax.legend(loc='upper right')
Out[112]:
<matplotlib.legend.Legend at 0x7fd5d92504c0>
In [108]:
# Mmh, looks like the majority of winners is male, but that there is a slight increase in female laureates
fig, ax = plt.subplots(figsize=(10, 10))
sns.countplot(data=data[(data.Sex=="Female")],hue='Sex', x='Decade', ax=ax)
ax.set_title(f'Countplot of the number of Nobel Prizes won in history:')
ax.set_ylabel('Nobel prize count')
Out[108]:
Text(0, 0.5, 'Nobel prize count')
In [77]:
# Mmh, looks like the majority of winners is male, but that there is a slight increase in female laureates
fig, ax = plt.subplots(figsize=(10, 10))
sns.countplot(data=data[(data.Sex=="Female")],hue='Category', x='Decade', ax=ax)
ax.set_title(f'Countplot of the number of Nobel Prizes won in history:')
ax.set_ylabel('Nobel prize count')
Out[77]:
Text(0, 0.5, 'Nobel prize count')
In [24]:
# The regression plot indeed indicates the share of female winners has increased over the years
pivot = pd.crosstab(df_nobel_prizes['Sex'], df_nobel_prizes['Decade'], values='Laureate ID', aggfunc='count',
                    normalize='columns')
pivot = pivot.transpose()

fig, ax = plt.subplots(figsize=(10, 10))
sns.regplot(ax=ax, data=pivot, x=pivot.index, y=pivot['Male'], color="lightblue")
sns.regplot(ax=ax, data=pivot, x=pivot.index, y=pivot['Female'], color="pink")
ax.set_title(f'Regression plot of the share of Male/Female Nobel Prizes won in history:')
ax.set_ylabel(f'Share of Nobel Prizes')
ax.yaxis.set_major_formatter(PercentFormatter(1.0))
ax.legend(labels=['Male', 'Female'])

# In absolute numbers the increase in female winners seems modest, where in relative numbers the regression line indicates that:
a, b = np.polyfit(pivot.index, pivot['Female'], 1)
print(f'Over the full history the share of Female Nobel prize winners has increased by {a*100:.1%} per decade.')
Over the full history the share of Female Nobel prize winners has increased by 4.3% per decade.
In [25]:
# Adding the regression lines for the last decades indicates the increase in share of female winners has become steeper
fig, ax = plt.subplots(figsize=(10, 10))
sns.regplot(ax=ax, data=pivot, x=pivot.index, y=pivot['Male'], color="lightblue")
sns.regplot(ax=ax, data=pivot, x=pivot.index, y=pivot['Female'], color="pink")
sns.regplot(ax=ax, data=pivot, x=pivot.index[-4:], y=pivot['Male'][-4:], color="blue")
sns.regplot(ax=ax, data=pivot, x=pivot.index[-4:], y=pivot['Female'][-4:], color="red")
ax.set_title(f'Regression plot of the share of Male/Female Nobel Prizes won over the last decades:')
ax.set_ylabel(f'Share of Nobel Prizes')
ax.yaxis.set_major_formatter(PercentFormatter(1.0))
ax.legend(labels=['Male', 'Female'])

# In absolute numbers the increase in female winners seems modest, where in relative numbers the regression line indicates that:
a, b = np.polyfit(pivot.index[-4:], pivot['Female'][-4:], 1)
print(f'Over the last 4 decades the share of Female Nobel prize winners has increased by {a*100:.1%} per decade.')
Over the last 4 decades the share of Female Nobel prize winners has increased by 19.9% per decade.
In [26]:
# Nobel prizes won by male and female laureates per category
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(10, 10))
sns.countplot(x='Category', data=df_nobel_prizes, hue='Sex', palette={"Male": "lightblue", "Female": "pink"}, ax=ax)
ax.set_title(f'Countplot of the number of Nobel Prizes won per category:')
ax.set_xlabel('Nobel prize category')
ax.set_ylabel('Nobel prize count')
ax.legend(loc='upper right')
Out[26]:
<matplotlib.legend.Legend at 0x7f9bb0f99fd0>
In [70]:
# Nobel prizes won by male and female laureates per category
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(10, 10))
sns.countplot(x='Category', data=data[(data.Sex == 'Female')], hue='Sex', palette={"Male": "lightblue", "Female": "pink"}, ax=ax)
ax.set_title(f'Countplot of the number of Nobel Prizes won per category:')
ax.set_xlabel('Nobel prize category')
ax.set_ylabel('Nobel prize count')
ax.legend(loc='upper right')
Out[70]:
<matplotlib.legend.Legend at 0x7fd5d9afc160>
In [27]:
# Well in some categories female laureates seem to perform quite well last decade
g = sns.catplot(kind='count', data=df_nobel, x='Decade', hue='Sex', col='Category', col_wrap=3,
                  palette={"Male": "lightblue", "Female": "pink"})
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle(f'Countplot of the number of Nobel Prizes won in history per category:')
Out[27]:
Text(0.5, 0.98, 'Countplot of the number of Nobel Prizes won in history per category:')
In [71]:
# Well in some categories female laureates seem to perform quite well last decade
g = sns.catplot(kind='count', data=data[(data.Sex == 'Female')], x='Decade', hue='Sex', col='Category', col_wrap=3,
                  palette={"Male": "lightblue", "Female": "pink"})
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle(f'Countplot of the number of Nobel Prizes won in history per category:')
Out[71]:
Text(0.5, 0.98, 'Countplot of the number of Nobel Prizes won in history per category:')
In [119]:
g = sns.FacetGrid(df_nobel_prizes, row='Category', height=2, aspect=4)
g.map_dataframe(sns.regplot, x='Year', y='Age', scatter=False, lowess=True, line_kws={'color': 'black'})  # Only Lowess for Male/Female combined
g.map_dataframe(sns.scatterplot, x='Year', y='Age', hue='Age_Group', palette={"Youth": "orange", "Young Adult": "forestgreen", "Adult": "royalblue", "Senior": "lightsteelblue"})
g.add_legend()
Out[119]:
<seaborn.axisgrid.FacetGrid at 0x7fd5db55f4c0>
In [118]:
fig, ax = plt.subplots(figsize=(10, 10))
df_nobel_prizes=df_nobel_prizes[(df_nobel_prizes.Sex=='Female')]
sns.regplot(ax=ax, data=df_nobel_prizes, x='Year', y='Age', scatter=False, lowess=True, line_kws={'color': 'black'})
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Youth']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="orange")
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Young Adult']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="forestgreen")
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Adult']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="royalblue")
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Senior']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="lightsteelblue")

ax.set_title(f'Regression plot of Age in relation to Nobel Prizes won in history:')
ax.legend(labels=['Average Age', 'Youth', 'Young Adult', 'Adult', 'Senior'], loc='upper right')
Out[118]:
<matplotlib.legend.Legend at 0x7fd5db538070>
In [154]:
import pandas as pd
In [148]:
!pip install pandas
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.9/site-packages (1.3.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas) (2021.3)
Requirement already satisfied: numpy>=1.17.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas) (1.19.2)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
In [168]:
plt.figure(figsize=(8,5))
plt.tight_layout()
gb_gender=df_nobel.groupby('Sex')['Laureate ID'].apply(lambda x:len(x.drop_duplicates()))
plt.pie(x=gb_gender,labels=['female','male'],autopct="%.1f%%")
plt.show()
In [165]:
df_nobel.head()
Out[165]:
Year Category Prize Motivation Prize Share Laureate ID Laureate Type Full Name Birth Date Birth City Birth Country Sex Organization Name Organization City Organization Country Death Date Death City Death Country
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services ... 1/1 160 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam Netherlands Male Berlin University Berlin Germany 1911-03-01 Berlin Germany
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit... 1/1 569 Individual Sully Prudhomme 1839-03-16 Paris France Male NaN NaN NaN 1907-09-07 Châtenay France
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 293 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) Prussia (Poland) Male Marburg University Marburg Germany 1917-03-31 Marburg Germany
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 462 Individual Jean Henry Dunant 1828-05-08 Geneva Switzerland Male NaN NaN NaN 1910-10-30 Heiden Switzerland
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 463 Individual Frédéric Passy 1822-05-20 Paris France Male NaN NaN NaN 1912-06-12 Paris France
In [169]:
plt.figure(figsize=(12,6))
plt.tight_layout()
gb_cag=df_nobel.groupby('Category')['Laureate ID'].apply(lambda x:len(x.drop_duplicates()))
plt.subplot(121)
plt.pie(x=gb_cag,autopct="%.1f%%",explode=(0,0.1,0.1,0.1,0,0.1),shadow=True)
plt.legend(['chemistry','economics','literature','medicine','peace'])
plt.subplot(122)
gb_cag.sort_values().plot(kind='bar')
plt.show()
In [121]:
df_nobel_prizes = df_nobel.drop_duplicates(subset=['Year', 'Category', 'Laureate ID'])
print(f'Number of (possibly shared) Nobel Prizes handed out between 1901 and 2016: {len(df_nobel_prizes)}')
Number of (possibly shared) Nobel Prizes handed out between 1901 and 2016: 911
In [171]:
import plotly.express as px
In [176]:
organizations = df_nobel['Organization Name'].dropna().unique()
len(organizations)
Out[176]:
315
In [179]:
organization_names = df_nobel.groupby('Organization Name')['Organization Name'].count().reset_index(name = 'count').sort_values(by='count', ascending = False)
fig = px.bar(organization_names[0:16], y='Organization Name', x = 'count', color = 'Organization Name')
fig.show()
In [180]:
organization_names.set_index('Organization Name', inplace=True)
In [181]:
cat_org = df_nobel.groupby(['Organization Name', 'Category'])['Organization Name'].count().reset_index(name = 'count').sort_values(by='count', ascending = False)
cat_org['NumberPerOrganization']=0
for org in organizations:
  cat_org['NumberPerOrganization'] += (cat_org['Organization Name']==org)*organization_names.loc[org, 'count']
In [188]:
cat_org.sort_values(by=['NumberPerOrganization', 'Organization Name'], ascending = False, inplace=True) 
In [221]:
fig = px.bar(cat_org[:53], y = 'Organization Name',  x = 'count', color='Category').update_yaxes(categoryorder='total ascending')
fig.show()
In [223]:
import plotly.io as pio
pio.write_html(fig, file= 'index.html', auto_open = True)
In [182]:
fig = px.bar(cat_org, x='count', y = 'Category', color = 'Organization Name')
fig.update_layout(width=1600, height=500)
fig.show()
In [193]:
cat_year = df_nobel.groupby(['Year','Category'])['Category'].count().reset_index(name = 'count')
fig = px.bar(cat_year, x='Year', y = 'count', color = 'Category')
fig.show()
In [195]:
gen_year = df_nobel.groupby(['Year','Sex', 'Category'])['Year'].count().reset_index(name = 'count')
fig = px.bar(gen_year, x='Year', y = 'count', color = 'Sex')
fig.show()
In [196]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
In [200]:
pie_df1960=df_nobel[df_nobel['Year']<=1960]['Sex'].value_counts().reset_index()
pie_df1960.columns=['sex', 'count']
pie_df1961_pr=df_nobel[df_nobel['Year']>1960]['Sex'].value_counts().reset_index()
pie_df1961_pr.columns=['sex', 'count']
fig = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['1901-1960', '1960-2016'])
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df1960['count'],
                     name='Starry Night'), 1, 1)
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df1961_pr['count'],
                     name='Starry Night'), 1, 2)
#fig=px.pie(pie_df1960, values="count", names="sex", title="proportion of genders",color_discrete_sequence=['blue', 'red'])
fig.show()
In [201]:
pie_df_chemistry =df_nobel[df_nobel['Category']=='Chemistry']['Sex'].value_counts().reset_index()
pie_df_literature = df_nobel[df_nobel['Category']=='Literature']['Sex'].value_counts().reset_index()
pie_df_medicine = df_nobel[df_nobel['Category']=='Medicine']['Sex'].value_counts().reset_index()
pie_df_peace = df_nobel[df_nobel['Category']=='Peace']['Sex'].value_counts().reset_index()
pie_df_physics = df_nobel[df_nobel['Category']=='Physics']['Sex'].value_counts().reset_index()
pie_df_economics = df_nobel[df_nobel['Category']=='Economics']['Sex'].value_counts().reset_index()
pie_df_chemistry.columns=['sex', 'count']
pie_df_literature.columns=['sex', 'count']
pie_df_medicine.columns=['sex', 'count']
pie_df_peace.columns=['sex', 'count']
pie_df_physics.columns=['sex', 'count']
pie_df_economics.columns=['sex', 'count']

fig = make_subplots(2, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                     vertical_spacing=0.2, horizontal_spacing=0.08, row_heights=[4, 4], subplot_titles=('Chemistry', 'Literature', 'Medicine', 'Peace', 'Physics', 'Economics'))
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df_chemistry['count'],
                     name='Female in Chemistry'), 1, 1)
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df_literature['count'], name='Female in Literature'), 1, 2)
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df_medicine['count'],
                     name='Female in Medicine'), 1, 3)
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df_peace['count'],
                     name='Female in Peace'), 2, 1)
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df_physics['count'],
                     name='Female in Physics'), 2, 2)
fig.add_trace(go.Pie(labels=['male', 'female'], values=pie_df_economics['count'],
                     name='Female in Economics'), 2, 3)
#fig=px.pie(pie_df1960, values="count", names="sex", title="proportion of genders",color_discrete_sequence=['blue', 'red'])
fig.update_traces(textinfo='none')
fig.show()
In [203]:
df_nobel['Birth Date'] = pd.to_datetime(df_nobel['Birth Date'], errors='coerce')

df_nobel['age'] = df_nobel['Year'] - df_nobel['Birth Date'].dt.year

plt.figure(figsize=(15, 7))
sns.swarmplot(x='Sex', y='age',hue = 'Category', dodge=True , data=df_nobel)
plt.ylabel('Age')
plt.xlabel('Gender')
plt.title('Every winner age seperated by gender and prize category')
plt.show()
In [216]:
df_nobel.Year = pd.to_datetime(df_nobel.Year)
df_nobel['winning_age'] = df_nobel.Year - df_nobel["Birth Date"]


with sns.axes_style("whitegrid"):
    sns.lmplot(data=df_nobel,
                x='Year',
                y='winning_age',
                hue='Category', 
                lowess=True, 
                aspect=2, 
                scatter_kws = {'alpha': 0.5},
                line_kws = {'linewidth': 5})

plt.title('Laureate Age when awarded')
plt.show()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/m1/8dggmjzn5tq4bqm0wf1dq92c0000gn/T/ipykernel_3016/3393083334.py in <module>
      4 
      5 with sns.axes_style("whitegrid"):
----> 6     sns.lmplot(data=df_nobel,
      7                 x='Year',
      8                 y='winning_age',

/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py in inner_f(*args, **kwargs)
     44             )
     45         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46         return f(**kwargs)
     47     return inner_f
     48 

/opt/anaconda3/lib/python3.9/site-packages/seaborn/regression.py in lmplot(x, y, data, hue, col, row, palette, col_wrap, height, aspect, markers, sharex, sharey, hue_order, col_order, row_order, legend, legend_out, x_estimator, x_bins, x_ci, scatter, fit_reg, ci, n_boot, units, seed, order, logistic, lowess, robust, logx, x_partial, y_partial, truncate, x_jitter, y_jitter, scatter_kws, line_kws, facet_kws, size)
    632         ax.autoscale_view(scaley=False)
    633 
--> 634     facets.map_dataframe(update_datalim, x=x, y=y)
    635 
    636     # Draw the regression plot on each facet

/opt/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py in map_dataframe(self, func, *args, **kwargs)
    775 
    776             # Draw the plot
--> 777             self._facet_plot(func, ax, args, kwargs)
    778 
    779         # For axis labels, prefer to use positional args for backcompat

/opt/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py in _facet_plot(self, func, ax, plot_args, plot_kwargs)
    804             plot_args = []
    805             plot_kwargs["ax"] = ax
--> 806         func(*plot_args, **plot_kwargs)
    807 
    808         # Sort out the supporting information

/opt/anaconda3/lib/python3.9/site-packages/seaborn/regression.py in update_datalim(data, x, y, ax, **kws)
    628 
    629     def update_datalim(data, x, y, ax, **kws):
--> 630         xys = np.asarray(data[[x, y]]).astype(float)
    631         ax.update_datalim(xys, updatey=False)
    632         ax.autoscale_view(scaley=False)

TypeError: float() argument must be a string or a number, not 'Timestamp'
In [210]:
df_nobel.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 969 entries, 0 to 968
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Year                  969 non-null    int64         
 1   Category              969 non-null    object        
 2   Prize                 969 non-null    object        
 3   Motivation            881 non-null    object        
 4   Prize Share           969 non-null    object        
 5   Laureate ID           969 non-null    int64         
 6   Laureate Type         969 non-null    object        
 7   Full Name             969 non-null    object        
 8   Birth Date            938 non-null    datetime64[ns]
 9   Birth City            941 non-null    object        
 10  Birth Country         943 non-null    object        
 11  Sex                   943 non-null    object        
 12  Organization Name     722 non-null    object        
 13  Organization City     716 non-null    object        
 14  Organization Country  716 non-null    object        
 15  Death Date            617 non-null    object        
 16  Death City            599 non-null    object        
 17  Death Country         605 non-null    object        
 18  age                   938 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(15)
memory usage: 144.0+ KB
In [102]:
# Define geopandas geometry dataframe:

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world = world[(world.pop_est>0) & (world.name!="Antarctica")]  # Reflect countries with population, leaving Antartica out

# Define nobel prize dataframe:
nobel = df_nobel_prizes.drop(['Birth Date', 'Death Date'], axis = 1)  # Geopandas conflict with Date format
nobel=nobel[(nobel.Sex == 'Female')]
nobel['Nobel_Country_Count'] = df_nobel_prizes.groupby('Birth Country')['Birth Country'].transform('count')  # Derives count of Nobel prizes per Birth Country


# Merge geopandas geometry and nobel prize dataframes
df = pd.merge(nobel, world, how='left', left_on='Birth Country', right_on='name').reset_index()
df_gdf = gpd.GeoDataFrame(df)

# Identify countries of birth for which 'geometry' was not merged
countries_not_reflected = df_gdf[df_gdf['geometry'].isna()]['Birth Country'].unique()
print(f'#{len(countries_not_reflected)} countries are not reflected in the world map as the country name differed over time (example: {countries_not_reflected[0]})')

# Plot world map!
fig, ax = plt.subplots(figsize=(30, 15))
ax.set_title(f'Countplot of the number of Nobel Prizes won per country:')
ax.set_axis_off()

# Format legend
divider = make_axes_locatable(ax)
cax = divider.append_axes("left", size="3%", pad=0.1)

world.plot(ax=ax, color='lightgrey')
df_gdf.plot(ax=ax, column='Nobel_Country_Count',cmap='viridis', legend=True, cax=cax)
plt.show()
#9 countries are not reflected in the world map as the country name differed over time (example: Russian Empire (Poland))
In [92]:
nobel['Nobel_Country_Count'].shape
Out[92]:
(49,)
In [ ]:
nobel=nobel[(nobel.Sex == 'Female')]

Nobel prizes by Country¶

In [42]:
# Define geopandas geometry dataframe:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world = world[(world.pop_est>0) & (world.name!="Antarctica")]  # Reflect countries with population, leaving Antartica out

# Define nobel prize dataframe:
nobel = df_nobel_prizes.drop(['Birth Date', 'Death Date'], axis = 1)  # Geopandas conflict with Date format
nobel['Nobel_Country_Count'] = df_nobel_prizes.groupby('Birth Country')['Birth Country'].transform('count')  # Derives count of Nobel prizes per Birth Country

# Merge geopandas geometry and nobel prize dataframes
df = pd.merge(nobel, world, how='left', left_on='Birth Country', right_on='name').reset_index()
df_gdf = gpd.GeoDataFrame(df)

# Identify countries of birth for which 'geometry' was not merged
countries_not_reflected = df_gdf[df_gdf['geometry'].isna()]['Birth Country'].unique()
print(f'#{len(countries_not_reflected)} countries are not reflected in the world map as the country name differed over time (example: {countries_not_reflected[0]})')

# Plot world map!
fig, ax = plt.subplots(figsize=(20, 10))
ax.set_title(f'Countplot of the number of Nobel Prizes won per country:')
ax.set_axis_off()

# Format legend
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1)

world.plot(ax=ax, color='lightgrey')
df_gdf.plot(ax=ax, column='Nobel_Country_Count',cmap='viridis', legend=True, cax=cax)
plt.show()
#68 countries are not reflected in the world map as the country name differed over time (example: Prussia (Poland))
In [29]:
# The top-10 countries where most Nobel prize laureates where born?
df = df_nobel_prizes['Birth Country'].value_counts().reset_index()

# Determine medal coloring for top-3 countries with respect to Nobel prize winners
colors = {}
for index, row in df.iterrows():
    n = row['Birth Country']
    if n==df['Birth Country'][0]:
        colors[row['index']] = 'gold'
    elif n==df['Birth Country'][1]:
        colors[row['index']] = 'silver'
    elif n==df['Birth Country'][2]:
        colors[row['index']] = 'darkgoldenrod'
    elif row['index']=='Netherlands':
        colors[row['index']] = 'orange'
    else:
        colors[row['index']] = 'lightblue'
        
fig, ax = plt.subplots(figsize=(10, 10))
sns.barplot(data=df.head(n=10), x='index', y='Birth Country', palette=colors, ax=ax)
ax.set_title(f'Countplot of the number of Nobel Prizes won by Birth Country:')
ax.set_ylabel('Nobel prize count')
ax.set_xticklabels(ax.get_xticklabels(), rotation=-25)
Out[29]:
[Text(0, 0, 'United States of America'),
 Text(1, 0, 'United Kingdom'),
 Text(2, 0, 'Germany'),
 Text(3, 0, 'France'),
 Text(4, 0, 'Sweden'),
 Text(5, 0, 'Japan'),
 Text(6, 0, 'Canada'),
 Text(7, 0, 'Netherlands'),
 Text(8, 0, 'Italy'),
 Text(9, 0, 'Russia')]
In [43]:
def country_rank(df, country):
    """Counts number of Nobel prizes awarded by 'Birth Country'"""
    countries = df['Birth Country'].unique()
    df = df['Birth Country'].value_counts().reset_index()
    df = df[df['index'] == 'Netherlands'].reset_index()  # Index number +1 is rank
    df.set_axis(['Rank', 'Country', 'Count'], axis=1, inplace=True)
    if len(df.index) == 1:
        c = df.loc[0, 'Country']
        i = df.loc[0, 'Rank']
        n = df.loc[0, 'Count']
        print(f'Country: {c} is ranked at place {i + 1} with #{n} Nobel prize winners')
    else:
        print(f'No valid records retrieved for: {country}, please submit any of the following countries: {countries}')

# How does your country rank with respect to Nobel prize laureates?
country_rank(df_nobel_prizes, country='Netherlands')
Country: Netherlands is ranked at place 8 with #18 Nobel prize winners

Nobel prizes by Age¶

In [31]:
fig, ax = plt.subplots(figsize=(10, 10))
sns.regplot(ax=ax, data=df_nobel_prizes, x='Year', y='Age', scatter=False, lowess=True, line_kws={'color': 'black'})
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Youth']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="orange")
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Young Adult']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="forestgreen")
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Adult']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="royalblue")
sns.regplot(ax=ax, data=df_nobel_prizes[df_nobel_prizes['Age_Group'] == 'Senior']
            , x='Year', y='Age', lowess=True, fit_reg=False, color="lightsteelblue")

ax.set_title(f'Regression plot of Age in relation to Nobel Prizes won in history:')
ax.legend(labels=['Average Age', 'Youth', 'Young Adult', 'Adult', 'Senior'], loc='upper right')
Out[31]:
<matplotlib.legend.Legend at 0x7f9b7016b250>
In [32]:
# The trend in age is clearly increasing for nobel prize winners, though we see some differences across the prize categories
g = sns.FacetGrid(df_nobel_prizes, row='Category', height=2, aspect=4)
g.map_dataframe(sns.regplot, x='Year', y='Age', scatter=False, lowess=True, line_kws={'color': 'black'})  # Only Lowess for Male/Female combined
g.map_dataframe(sns.scatterplot, x='Year', y='Age', hue='Age_Group', palette={"Youth": "orange", "Young Adult": "forestgreen", "Adult": "royalblue", "Senior": "lightsteelblue"})
g.add_legend()
Out[32]:
<seaborn.axisgrid.FacetGrid at 0x7f9bc02772e0>
In [33]:
# The oldest winner of a Nobel Prize as of 2016
df_nobel_prizes.nlargest(1, "Age")
Out[33]:
Year Category Prize Motivation Prize Share Laureate ID Laureate Type Full Name Birth Date Birth City ... Organization Name Organization City Organization Country Death Date Death City Death Country Decade Age Age_Group Generation
825 2007 Economics The Sveriges Riksbank Prize in Economic Scienc... "for having laid the foundations of mechanism ... 1/3 820 Individual Leonid Hurwicz 1917-08-21 Moscow ... University of Minnesota Minneapolis, MN United States of America 2008-06-24 Minneapolis, MN United States of America 2000 90.0 Senior Greatest Generation

1 rows × 22 columns

In [34]:
# The youngest winner of a Nobel Prize as of 2016
df_nobel_prizes.nsmallest(1, "Age")
Out[34]:
Year Category Prize Motivation Prize Share Laureate ID Laureate Type Full Name Birth Date Birth City ... Organization Name Organization City Organization Country Death Date Death City Death Country Decade Age Age_Group Generation
940 2014 Peace The Nobel Peace Prize 2014 "for their struggle against the suppression of... 1/2 914 Individual Malala Yousafzai 1997-07-12 Mingora ... NaN NaN NaN NaN NaN NaN 2010 17.0 Youth Generation Z

1 rows × 22 columns

Nobel prizes by Generation¶

In [35]:
# Lets also have a look at the Nobel laureates per generation
fig, ax = plt.subplots(figsize=(10, 10))
sns.boxplot(ax=ax, data=df_nobel_prizes, x='Year', y='Generation',fliersize=0, palette="hls")
sns.stripplot(ax=ax, data=df_nobel_prizes, x='Year', y='Generation', palette="hls")
ax.set_title(f'Boxplot of Generation in relation to Nobel Prizes won in history:')
Out[35]:
Text(0.5, 1.0, 'Boxplot of Generation in relation to Nobel Prizes won in history:')

Repeat laureates¶

In [36]:
repeat_laureates = df_nobel_prizes.groupby('Full Name').filter(lambda winner: len(winner) > 1)
display(repeat_laureates[['Full Name', 'Birth Country', 'Laureate Type']].value_counts().to_frame())
0
Full Name Birth Country Laureate Type
Frederick Sanger United Kingdom Individual 2
John Bardeen United States of America Individual 2
Linus Carl Pauling United States of America Individual 2
Marie Curie, née Sklodowska Russian Empire (Poland) Individual 2
In [62]:
plt.figure(figsize= (30,20))
sns.swarmplot(y ="Category", x = "Year", data = df_nobel, hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [57]:
plt.figure(figsize= (18,5))
sns.swarmplot(y ="Category", x = "Year", data = data[(data.Category == 'Chemistry')], hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category of Chemistry", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [56]:
plt.figure(figsize= (18,5))
sns.swarmplot(y ="Category", x = "Year", data = data[(data.Category == 'Literature')], hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category of Literature", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [55]:
plt.figure(figsize= (18,5))
sns.swarmplot(y ="Category", x = "Year", data = data[(data.Category == 'Medicine')], hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category of Medicine", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [58]:
plt.figure(figsize= (18,5))
sns.swarmplot(y ="Category", x = "Year", data = data[(data.Category == 'Peace')], hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category of Peace", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [59]:
plt.figure(figsize= (18,5))
sns.swarmplot(y ="Category", x = "Year", data = data[(data.Category == 'Physics')], hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category of Physics", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [60]:
plt.figure(figsize= (18,5))
sns.swarmplot(y ="Category", x = "Year", data = data[(data.Category == 'Economics')], hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category of Economics", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [65]:
plt.figure(figsize= (30,20))
sns.swarmplot(y ="Category", x = "Year", data = data[(data.Sex == 'Female')], hue = "Sex",)
plt.suptitle("Gender distribution of prize winners by year and category only Female", fontsize = 20)

sns.despine(top = True, right = True, left = False, bottom = False)


plt.show()
In [45]:
plt.figure(figsize= (30,10))
sns.kdeplot(
   data=df, x="Age", hue="Category",
   fill=True, common_norm=False, palette="Paired",
   alpha=.5, linewidth=0,
)
sns.despine(top = True, right = True, left = False, bottom = False)
plt.suptitle("Age distribution of prize winners by category", fontsize = 20)
plt.show()
In [66]:
pip install bar_chart_race
Collecting bar_chart_race
  Downloading bar_chart_race-0.1.0-py3-none-any.whl (156 kB)
     |████████████████████████████████| 156 kB 356 kB/s eta 0:00:01
Requirement already satisfied: matplotlib>=3.1 in /opt/anaconda3/lib/python3.9/site-packages (from bar_chart_race) (3.4.3)
Requirement already satisfied: pandas>=0.24 in /opt/anaconda3/lib/python3.9/site-packages (from bar_chart_race) (1.3.4)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=3.1->bar_chart_race) (3.0.4)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=3.1->bar_chart_race) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=3.1->bar_chart_race) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=3.1->bar_chart_race) (8.4.0)
Requirement already satisfied: numpy>=1.16 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=3.1->bar_chart_race) (1.19.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=3.1->bar_chart_race) (2.8.2)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.9/site-packages (from cycler>=0.10->matplotlib>=3.1->bar_chart_race) (1.16.0)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas>=0.24->bar_chart_race) (2021.3)
Installing collected packages: bar-chart-race
Successfully installed bar-chart-race-0.1.0
Note: you may need to restart the kernel to use updated packages.
In [67]:
df=df_nobel
import bar_chart_race as bcr
bcr.bar_chart_race(
    df=df,
    filename='nobel_prize.mp4',
    orientation='h',
    sort='desc',
    n_bars=6,
    fixed_order=False,
    fixed_max=True,
    steps_per_period=10,
    interpolate_period=False,
    label_bars=True,
    bar_size=.95,
    period_label={'x': .99, 'y': .25, 'ha': 'right', 'va': 'center'},
    period_fmt='%B %d, %Y',
    period_summary_func=lambda v, r: {'x': .99, 'y': .18,
                                      's': f'Countries: {v.nlargest(6).sum():,.0f}',
                                      'ha': 'right', 'size': 8, 'family': 'Courier New'},
    perpendicular_bar_func='median',
    period_length=500,
    figsize=(5, 3),
    dpi=144,
    cmap='dark12',
    title='Wins by Country',
    title_size='',
    bar_label_size=7,
    tick_label_size=7,
    shared_fontdict={'family' : 'Helvetica', 'color' : '.1'},
    scale='linear',
    writer=None,
    fig=None,
    bar_kwargs={'alpha': .7},
    filter_column_colors=False)
/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py:278: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  max_val = self.df_values.max().max()
/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py:286: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(self.df_values.columns)
/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py:287: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels([max_val] * len(ax.get_xticks()))
/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py:251: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  ax.set_xlim(min_val, self.df_values.max().max() * 1.05 * 1.11)
MovieWriter ffmpeg unavailable; using Pillow instead.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/opt/anaconda3/lib/python3.9/site-packages/matplotlib/animation.py in saving(self, fig, outfile, dpi, *args, **kwargs)
    235         try:
--> 236             yield self
    237         finally:

/opt/anaconda3/lib/python3.9/site-packages/matplotlib/animation.py in save(self, filename, writer, fps, dpi, codec, bitrate, extra_args, metadata, extra_anim, savefig_kwargs, progress_callback)
   1159             for anim in all_anim:
-> 1160                 anim._init_draw()  # Clear the initial frame
   1161             frame_number = 0

/opt/anaconda3/lib/python3.9/site-packages/matplotlib/animation.py in _init_draw(self)
   1755         else:
-> 1756             self._drawn_artists = self._init_func()
   1757             if self._blit:

/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py in init_func()
    419         def init_func():
--> 420             self.plot_bars(0)
    421 

/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py in plot_bars(self, i)
    325         bar_location = bar_location[top_filt]
--> 326         bar_length = self.df_values.iloc[i].values[top_filt]
    327         cols = self.df_values.columns[top_filt]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 18 but corresponding boolean dimension is 2

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py in make_animation(self)
    434             else:
--> 435                 ret_val = anim.save(self.filename, fps=self.fps, writer=self.writer)
    436         except Exception as e:

/opt/anaconda3/lib/python3.9/site-packages/matplotlib/animation.py in save(self, filename, writer, fps, dpi, codec, bitrate, extra_args, metadata, extra_anim, savefig_kwargs, progress_callback)
   1176                         frame_number += 1
-> 1177                 writer.grab_frame(**savefig_kwargs)
   1178 

/opt/anaconda3/lib/python3.9/contextlib.py in __exit__(self, typ, value, traceback)
    136             try:
--> 137                 self.gen.throw(typ, value, traceback)
    138             except StopIteration as exc:

/opt/anaconda3/lib/python3.9/site-packages/matplotlib/animation.py in saving(self, fig, outfile, dpi, *args, **kwargs)
    237         finally:
--> 238             self.finish()
    239 

/opt/anaconda3/lib/python3.9/site-packages/matplotlib/animation.py in finish(self)
    539     def finish(self):
--> 540         self._frames[0].save(
    541             self.outfile, save_all=True, append_images=self._frames[1:],

IndexError: list index out of range

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
/var/folders/m1/8dggmjzn5tq4bqm0wf1dq92c0000gn/T/ipykernel_3016/4108323816.py in <module>
      1 df=df_nobel
      2 import bar_chart_race as bcr
----> 3 bcr.bar_chart_race(
      4     df=df,
      5     filename='nobel_prize.mp4',

/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py in bar_chart_race(df, filename, orientation, sort, n_bars, fixed_order, fixed_max, steps_per_period, period_length, interpolate_period, label_bars, bar_size, period_label, period_fmt, period_summary_func, perpendicular_bar_func, figsize, cmap, title, title_size, bar_label_size, tick_label_size, shared_fontdict, scale, writer, fig, dpi, bar_kwargs, filter_column_colors)
    781                         figsize, cmap, title, title_size, bar_label_size, tick_label_size,
    782                         shared_fontdict, scale, writer, fig, dpi, bar_kwargs, filter_column_colors)
--> 783     return bcr.make_animation()
    784 
    785 def load_dataset(name='covid19'):

/opt/anaconda3/lib/python3.9/site-packages/bar_chart_race/_make_chart.py in make_animation(self)
    444             else:
    445                 message = str(e)
--> 446             raise Exception(message)
    447         finally:
    448             plt.rcParams = self.orig_rcParams

Exception: You do not have ffmpeg installed on your machine. Download
                            ffmpeg from here: https://www.ffmpeg.org/download.html.
                            
                            Matplotlib's original error message below:

                            list index out of range
                            
In [ ]:
Full code:
———————————————————————————————
import pandas as pd
import bar_chart_race as bcr

# open csv file from John Hopkins university
df = pd.read_csv('time_series_covid19_confirmed_global.csv')

# remove longitude and latitude values
df = df.drop(columns=["Lat","Long"])

# combine Province/State and Country/Region and make a new column called Location
df['Location'] = df[['Province/State','Country/Region']].apply(lambda x: ', '.join(x.dropna()),axis=1)

# remove Province/State and Country/Region columns
df = df.drop(columns=['Province/State', 'Country/Region'])

# move the combined values to the first column
cols = list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols]

# transpose the dataframe by flipping the columns and rows
df_transposed = df.T
df_transposed.columns = df_transposed.iloc[0].to_list()
df_transposed = df_transposed.iloc[1:]
df_transposed

# label the index as "Date"
df_transposed.index.names = ['Date']

# specify countries to be included for pre-processing ()
cols = ['Hubei, China','Germany','Spain','United Kingdom','US','India', 'Brazil','Russia','France','Italy']
subset = df_transposed[cols]

# create a new dataframe and make sure all the cells are in a numeric form
cum_sum_df = subset.cumsum(axis=0)
cum_sum_df = cum_sum_df.apply(pd.to_numeric)

# turn index to datetime objects
cum_sum_df.index = pd.to_datetime(cum_sum_df.index)

# plot the racebars

bcr.bar_chart_race(
                    df=cum_sum_df,
                    title="COVID-19 Case by country",
                    filename="covid-19-visualization.mp4",
                    period_fmt="%b %-d, %Y",
                    n_bars=8,
                    steps_per_period=100,
                    interpolate_period=True
                    )
In [68]:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/m1/8dggmjzn5tq4bqm0wf1dq92c0000gn/T/ipykernel_3016/3808815877.py in <module>
----> 1 bin/bash(-c, "$(curl, -fsSL, https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)")

NameError: name 'bash' is not defined
In [134]:
USAData = data[data['Birth Country']=='United States of America']
USAData.head()
Out[134]:
Year Category Prize Motivation Prize Share Laureate ID Laureate Type Full Name Birth Date Birth City ... Organization Name Organization City Organization Country Death Date Death City Death Country Decade Age Age_Group Generation
35 1906 Peace The Nobel Peace Prize 1906 NaN 1/1 470 Individual Theodore Roosevelt 1858-10-27 New York, NY ... NaN NaN NaN 1919-01-06 Oyster Bay, NY United States of America 1900 48.0 Adult Ancient
73 1912 Peace The Nobel Peace Prize 1912 NaN 1/1 480 Individual Elihu Root 1845-02-15 Clinton, NY ... NaN NaN NaN 1937-02-07 New York, NY United States of America 1910 67.0 Senior Ancient
80 1914 Chemistry The Nobel Prize in Chemistry 1914 "in recognition of his accurate determinations... 1/1 175 Individual Theodore William Richards 1868-01-31 Germantown, PA ... Harvard University Cambridge, MA United States of America 1928-04-02 Cambridge, MA United States of America 1910 46.0 Adult Ancient
96 1919 Peace The Nobel Peace Prize 1919 NaN 1/1 483 Individual Thomas Woodrow Wilson 1856-12-28 Staunton, VA ... NaN NaN NaN 1924-02-03 Washington, DC United States of America 1910 63.0 Adult Ancient
118 1923 Physics The Nobel Prize in Physics 1923 "for his work on the elementary charge of elec... 1/1 28 Individual Robert Andrews Millikan 1868-03-22 Morrison, IL ... California Institute of Technology (Caltech) Pasadena, CA United States of America 1953-12-19 San Marino, CA United States of America 1920 55.0 Adult Ancient

5 rows × 22 columns

In [139]:
import altair as alt
from vega_datasets import data

counties = alt.topo_feature(data.us_10m.url, USAData['Birth City'])
source = USAData

alt.Chart(counties).mark_geoshape().encode( color='rate:Q').transform_lookup(
    lookup='id',
    from_=alt.LookupData(source, 'id', ['rate'])
).project(
    type='albersUsa'
).properties(" width=500,height=300"
)
---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
/var/folders/m1/8dggmjzn5tq4bqm0wf1dq92c0000gn/T/ipykernel_3016/2890461058.py in <module>
      2 from vega_datasets import data
      3 
----> 4 counties = alt.topo_feature(data.us_10m.url, USAData['Birth City'])
      5 source = USAData
      6 

/opt/anaconda3/lib/python3.9/site-packages/altair/vegalite/v4/api.py in topo_feature(url, feature, **kwargs)
   2465     """
   2466     return core.UrlData(
-> 2467         url=url, format=core.TopoDataFormat(type="topojson", feature=feature, **kwargs)
   2468     )
   2469 

/opt/anaconda3/lib/python3.9/site-packages/altair/vegalite/v4/schema/core.py in __init__(self, feature, mesh, parse, type, **kwds)
  18339 
  18340     def __init__(self, feature=Undefined, mesh=Undefined, parse=Undefined, type=Undefined, **kwds):
> 18341         super(TopoDataFormat, self).__init__(feature=feature, mesh=mesh, parse=parse, type=type, **kwds)
  18342 
  18343 

/opt/anaconda3/lib/python3.9/site-packages/altair/vegalite/v4/schema/core.py in __init__(self, *args, **kwds)
   3563 
   3564     def __init__(self, *args, **kwds):
-> 3565         super(DataFormat, self).__init__(*args, **kwds)
   3566 
   3567 

/opt/anaconda3/lib/python3.9/site-packages/altair/utils/schemapi.py in __init__(self, *args, **kwds)
    175 
    176         if DEBUG_MODE and self._class_is_valid_at_instantiation:
--> 177             self.to_dict(validate=True)
    178 
    179     def copy(self, deep=True, ignore=()):

/opt/anaconda3/lib/python3.9/site-packages/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    338                 self.validate(result)
    339             except jsonschema.ValidationError as err:
--> 340                 raise SchemaValidationError(self, err)
    341         return result
    342 

SchemaValidationError: Invalid specification

        altair.vegalite.v4.schema.core.TopoDataFormat->feature, validating 'type'

        {35: 'New York, NY', 73: 'Clinton, NY', 80: 'Germantown, PA', 96: 'Staunton, VA', 118: 'Morrison, IL', 125: 'Marietta, OH', 139: 'Wooster, OH', 150: 'Potsdam, NY', 153: 'Sauk Centre, MN', 163: 'Cedarville, IL', 164: 'Elizabeth, NJ', 165: 'Brooklyn, NY', 171: 'Lexington, KY', 175: 'Walkerton, IN', 177: 'Ashland, NH', 178: 'Boston, MA', 179: 'Stoughton, WI', 180: 'Stoughton, WI', 189: 'New York, NY', 194: 'New York, NY', 200: 'Bloomington, IL', 204: 'Hillsboro, WV', 213: 'Canton, SD', 216: 'Hume, IL', 220: 'San Francisco, CA', 221: 'Platteville, WI', 229: 'Olympus, TN', 231: 'Canton, MA', 232: 'Yonkers, NY', 233: 'Ridgeville, IN', 235: 'New York, NY', 236: 'Jamaica Plain, MA (Boston)', 237: 'Livingston Manor, NY', 238: 'Cambridge, MA', 248: 'St. Louis, MO', 252: 'New Albany, MS', 262: 'South Norwalk, CT', 264: 'Pittsburgh, PA', 265: 'Detroit, MI', 267: 'Redondo Beach, CA', 268: 'Ishpeming, MI', 280: 'Taylorville, IL', 287: 'Uniontown, PA', 289: 'Portland, OR', 290: 'Oak Park, IL', 291: 'West Hartford, CT', 292: 'West Hartford, CT', 293: 'Ann Arbor, MI', 294: 'Auburn, AL', 299: 'Chicago, IL', 302: 'Los Angeles, CA', 310: 'Orange, NJ', 312: 'Madison, WI', 322: 'Wahoo, NE', 323: 'Boulder, CO', 324: 'Montclair, NJ', 334: 'Brooklyn, NY', 337: 'San Francisco, CA', 338: 'Grand Valley, CO', 343: 'Cleveland, OH', 344: 'St. Paul, MN', 348: 'New York, NY', 353: 'Salinas, CA', 355: 'Chicago, IL', 357: 'Portland, OR', 374: 'Atlanta, GA', 375: 'Greenville, SC', 378: 'Boston, MA', 385: 'New York, NY', 386: 'New York, NY', 387: 'Newburyport, MA', 390: 'Baltimore, MD', 398: 'Bloomsburg, PA', 399: 'New York, NY', 403: 'Urbana, IL', 405: 'New York, NY', 407: 'San Francisco, CA', 414: 'Owosso, MI', 417: 'New York, NY', 419: 'Gary, IN', 423: 'New York, NY', 424: 'Cresco, IA', 430: 'Burlingame, KS', 433: 'Monessen, PA', 434: 'Chicago, IL', 435: 'New York, NY', 437: 'New York, NY', 439: 'New York, NY', 441: 'Madison, WI', 442: 'New York, NY', 443: 'Oak Park, IL', 456: 'Sterling, IL', 474: 'New York, NY', 476: 'Philadelphia, PA', 479: 'Chicago, IL', 480: 'Council, ID', 481: 'Cleveland, OH', 482: 'Brooklyn, NY', 484: 'New York, NY', 485: 'Yonkers, NY', 488: 'Brooklyn, NY', 489: 'Ann Arbor, MI', 497: 'New York, NY', 499: 'Indianapolis, IN', 501: 'Middletown, CT', 503: 'Milwaukee, WI', 506: 'Wilmington, DE', 507: 'New York, NY', 512: 'Houston, TX', 515: 'Arlington, SD', 521: 'New York, NY', 524: 'New York, NY', 526: 'New York, NY', 527: 'Boston, MA', 528: 'Omaha, NE', 532: 'Bradford, MA', 534: 'Chicago, IL', 535: 'Merriman, NE', 538: 'Champaign, IL', 540: 'Hartford, CT', 545: 'Mount Verno, NY', 548: 'Renton, WA', 555: 'Waltham, MA', 559: 'Hartford, CT', 562: 'Pittsburgh, PA', 563: 'Fort Worth, TX', 572: 'New York, NY', 573: 'New York, NY', 576: 'New York, NY', 577: 'Sumter, SC', 580: 'San José, CA', 583: 'Murfreesboro, TN', 585: 'Brooklyn, NY', 591: 'Chester, VT', 595: 'Brooklyn, NY', 608: 'New York, NY', 609: 'Hoquiam, WA', 611: 'New York, NY', 612: 'New York, NY', 615: 'Chicago, IL', 618: 'York, PA', 619: 'Oceanside, NY', 621: 'Washington, DC', 624: 'Methuen, MA', 625: 'Chicago, IL', 626: 'Boston, MA', 627: 'Boston, MA', 629: 'Milford, MA', 630: 'Mart, TX', 632: 'Chicago, IL', 633: 'Boston, MA', 643: 'Pottsville, PA', 646: 'Lansing, IA', 650: 'Lenoir, NC', 652: 'New York, NY', 653: 'Cambridge, MA', 654: 'Lorain, OH', 656: 'Falmouth, KY', 659: 'New York, NY', 660: 'Philadelphia, PA', 663: 'Bluefield, WV', 666: 'New Haven, CT', 667: 'Baltimore, MD', 672: 'Pittsburgh, PA', 675: 'Delaware, OH', 676: 'Yakima, WA', 678: 'Wilkes-Barre, PA', 680: 'South Bend, IN', 683: 'New York, NY', 684: 'Paterson, NJ', 685: 'Alice, TX', 687: 'Akron, OH', 695: 'Rye, NY', 696: 'Aberdeen, WA', 697: 'Washington, DC', 698: 'Provo, UT', 701: 'New York, NY', 704: 'Des Moines, IA', 706: 'Putney, VT', 707: 'St. Louis, MO', 710: 'Wilkes-Barre, PA', 715: 'Charleston, SC', 716: 'Brooklyn, NY', 717: 'Whiting, IN', 720: 'Visalia, CA', 731: 'Sioux City, IA', 734: 'Chicago, IL', 735: 'Raleigh, NC', 738: 'New York, NY', 743: 'Jefferson City, MO', 744: 'Taunton, MA', 746: 'Philadelphia, PA', 747: 'New Haven, CT', 748: 'Montclair, NJ', 749: 'Gary, IN', 751: 'Los Angeles, CA', 756: 'Palo Alto, CA', 758: 'Corvallis, OR', 759: 'New York, NY', 764: 'Wichita, KS', 767: 'Chicago, IL', 769: 'Plains, GA', 770: 'Washington, DC', 773: 'Northfield, MN', 774: 'Burlington, MA', 775: 'Burlington, MA', 776: 'Syracuse, NY', 779: 'Sidney, OH', 787: 'Brooklyn, NY', 790: 'Glens Falls, NY', 791: 'Glens Falls, NY', 793: 'New York, NY', 794: 'Seattle, WA', 796: 'Washington, DC', 797: 'New York, NY', 798: 'New York, NY', 800: 'Possum Trot, KY', 801: 'Berne, IN', 803: 'Oakland, CA', 810: 'New York, NY', 811: 'Denver, CO', 812: 'Denver, CO', 815: 'St. Louis, MO', 816: 'Evanston, IL', 818: 'Stanford, CA', 819: 'New Haven, CT', 822: 'Roanoke, VA', 823: 'Yukon, FL', 826: 'New York, NY', 827: 'Boston, MA', 834: 'Washington, DC', 840: 'Chicago, IL', 841: 'New York, NY', 842: 'New York, NY', 843: 'New York, NY', 854: 'Milwaukee, WI', 855: 'Milwaukee, WI', 857: 'Los Angeles, CA', 858: 'Los Angeles, CA', 859: 'Superior, WI', 862: 'San Diego, CA', 866: 'Honolulu, HI', 870: 'White Plains, NY', 871: 'Springfield, MA', 874: 'New York, NY', 875: 'Enterprise, OR', 876: 'Enterprise, OR', 884: 'Pasadena, CA', 885: 'Washington, DC', 887: 'Chicago, IL', 888: 'Chicago, IL', 894: 'Champaign-Urbana, IL', 895: 'Champaign-Urbana, IL', 896: 'Missoula, MT', 897: 'Washington, DC', 898: 'Washington, DC', 899: 'New York, NY', 900: 'New York, NY', 901: 'Little Falls, MN', 902: 'New York, NY', 903: 'New York, NY', 904: 'Cambridge, MA', 912: 'Milwaukee, WI', 913: 'Milwaukee, WI', 918: 'Boston, MA', 919: 'Urbana, IL', 920: 'Detroit, MI', 922: 'Haverhill, MA', 923: 'St. Paul, MN', 924: 'St. Paul, MN', 930: 'Ann Arbor, MI', 933: 'Pleasanton, CA', 936: 'New York, NY', 947: 'Raton, NM', 948: 'Raton, NM', 963: 'Duluth, MN'} is not of type 'string'
        
In [140]:
import altair as alt
from vega_datasets import data

counties = alt.topo_feature(data.us_10m.url, USAData['Birth City'])
source = data.unemployment.url

alt.Chart(counties).mark_geoshape().encode(
    color='rate:Q'
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(source, 'id', ['rate'])
).project(
    type='albersUsa'
).properties(
    width=500,
    height=300
)
---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
/var/folders/m1/8dggmjzn5tq4bqm0wf1dq92c0000gn/T/ipykernel_3016/73697953.py in <module>
      2 from vega_datasets import data
      3 
----> 4 counties = alt.topo_feature(data.us_10m.url, USAData['Birth City'])
      5 source = data.unemployment.url
      6 

/opt/anaconda3/lib/python3.9/site-packages/altair/vegalite/v4/api.py in topo_feature(url, feature, **kwargs)
   2465     """
   2466     return core.UrlData(
-> 2467         url=url, format=core.TopoDataFormat(type="topojson", feature=feature, **kwargs)
   2468     )
   2469 

/opt/anaconda3/lib/python3.9/site-packages/altair/vegalite/v4/schema/core.py in __init__(self, feature, mesh, parse, type, **kwds)
  18339 
  18340     def __init__(self, feature=Undefined, mesh=Undefined, parse=Undefined, type=Undefined, **kwds):
> 18341         super(TopoDataFormat, self).__init__(feature=feature, mesh=mesh, parse=parse, type=type, **kwds)
  18342 
  18343 

/opt/anaconda3/lib/python3.9/site-packages/altair/vegalite/v4/schema/core.py in __init__(self, *args, **kwds)
   3563 
   3564     def __init__(self, *args, **kwds):
-> 3565         super(DataFormat, self).__init__(*args, **kwds)
   3566 
   3567 

/opt/anaconda3/lib/python3.9/site-packages/altair/utils/schemapi.py in __init__(self, *args, **kwds)
    175 
    176         if DEBUG_MODE and self._class_is_valid_at_instantiation:
--> 177             self.to_dict(validate=True)
    178 
    179     def copy(self, deep=True, ignore=()):

/opt/anaconda3/lib/python3.9/site-packages/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    338                 self.validate(result)
    339             except jsonschema.ValidationError as err:
--> 340                 raise SchemaValidationError(self, err)
    341         return result
    342 

SchemaValidationError: Invalid specification

        altair.vegalite.v4.schema.core.TopoDataFormat->feature, validating 'type'

        {35: 'New York, NY', 73: 'Clinton, NY', 80: 'Germantown, PA', 96: 'Staunton, VA', 118: 'Morrison, IL', 125: 'Marietta, OH', 139: 'Wooster, OH', 150: 'Potsdam, NY', 153: 'Sauk Centre, MN', 163: 'Cedarville, IL', 164: 'Elizabeth, NJ', 165: 'Brooklyn, NY', 171: 'Lexington, KY', 175: 'Walkerton, IN', 177: 'Ashland, NH', 178: 'Boston, MA', 179: 'Stoughton, WI', 180: 'Stoughton, WI', 189: 'New York, NY', 194: 'New York, NY', 200: 'Bloomington, IL', 204: 'Hillsboro, WV', 213: 'Canton, SD', 216: 'Hume, IL', 220: 'San Francisco, CA', 221: 'Platteville, WI', 229: 'Olympus, TN', 231: 'Canton, MA', 232: 'Yonkers, NY', 233: 'Ridgeville, IN', 235: 'New York, NY', 236: 'Jamaica Plain, MA (Boston)', 237: 'Livingston Manor, NY', 238: 'Cambridge, MA', 248: 'St. Louis, MO', 252: 'New Albany, MS', 262: 'South Norwalk, CT', 264: 'Pittsburgh, PA', 265: 'Detroit, MI', 267: 'Redondo Beach, CA', 268: 'Ishpeming, MI', 280: 'Taylorville, IL', 287: 'Uniontown, PA', 289: 'Portland, OR', 290: 'Oak Park, IL', 291: 'West Hartford, CT', 292: 'West Hartford, CT', 293: 'Ann Arbor, MI', 294: 'Auburn, AL', 299: 'Chicago, IL', 302: 'Los Angeles, CA', 310: 'Orange, NJ', 312: 'Madison, WI', 322: 'Wahoo, NE', 323: 'Boulder, CO', 324: 'Montclair, NJ', 334: 'Brooklyn, NY', 337: 'San Francisco, CA', 338: 'Grand Valley, CO', 343: 'Cleveland, OH', 344: 'St. Paul, MN', 348: 'New York, NY', 353: 'Salinas, CA', 355: 'Chicago, IL', 357: 'Portland, OR', 374: 'Atlanta, GA', 375: 'Greenville, SC', 378: 'Boston, MA', 385: 'New York, NY', 386: 'New York, NY', 387: 'Newburyport, MA', 390: 'Baltimore, MD', 398: 'Bloomsburg, PA', 399: 'New York, NY', 403: 'Urbana, IL', 405: 'New York, NY', 407: 'San Francisco, CA', 414: 'Owosso, MI', 417: 'New York, NY', 419: 'Gary, IN', 423: 'New York, NY', 424: 'Cresco, IA', 430: 'Burlingame, KS', 433: 'Monessen, PA', 434: 'Chicago, IL', 435: 'New York, NY', 437: 'New York, NY', 439: 'New York, NY', 441: 'Madison, WI', 442: 'New York, NY', 443: 'Oak Park, IL', 456: 'Sterling, IL', 474: 'New York, NY', 476: 'Philadelphia, PA', 479: 'Chicago, IL', 480: 'Council, ID', 481: 'Cleveland, OH', 482: 'Brooklyn, NY', 484: 'New York, NY', 485: 'Yonkers, NY', 488: 'Brooklyn, NY', 489: 'Ann Arbor, MI', 497: 'New York, NY', 499: 'Indianapolis, IN', 501: 'Middletown, CT', 503: 'Milwaukee, WI', 506: 'Wilmington, DE', 507: 'New York, NY', 512: 'Houston, TX', 515: 'Arlington, SD', 521: 'New York, NY', 524: 'New York, NY', 526: 'New York, NY', 527: 'Boston, MA', 528: 'Omaha, NE', 532: 'Bradford, MA', 534: 'Chicago, IL', 535: 'Merriman, NE', 538: 'Champaign, IL', 540: 'Hartford, CT', 545: 'Mount Verno, NY', 548: 'Renton, WA', 555: 'Waltham, MA', 559: 'Hartford, CT', 562: 'Pittsburgh, PA', 563: 'Fort Worth, TX', 572: 'New York, NY', 573: 'New York, NY', 576: 'New York, NY', 577: 'Sumter, SC', 580: 'San José, CA', 583: 'Murfreesboro, TN', 585: 'Brooklyn, NY', 591: 'Chester, VT', 595: 'Brooklyn, NY', 608: 'New York, NY', 609: 'Hoquiam, WA', 611: 'New York, NY', 612: 'New York, NY', 615: 'Chicago, IL', 618: 'York, PA', 619: 'Oceanside, NY', 621: 'Washington, DC', 624: 'Methuen, MA', 625: 'Chicago, IL', 626: 'Boston, MA', 627: 'Boston, MA', 629: 'Milford, MA', 630: 'Mart, TX', 632: 'Chicago, IL', 633: 'Boston, MA', 643: 'Pottsville, PA', 646: 'Lansing, IA', 650: 'Lenoir, NC', 652: 'New York, NY', 653: 'Cambridge, MA', 654: 'Lorain, OH', 656: 'Falmouth, KY', 659: 'New York, NY', 660: 'Philadelphia, PA', 663: 'Bluefield, WV', 666: 'New Haven, CT', 667: 'Baltimore, MD', 672: 'Pittsburgh, PA', 675: 'Delaware, OH', 676: 'Yakima, WA', 678: 'Wilkes-Barre, PA', 680: 'South Bend, IN', 683: 'New York, NY', 684: 'Paterson, NJ', 685: 'Alice, TX', 687: 'Akron, OH', 695: 'Rye, NY', 696: 'Aberdeen, WA', 697: 'Washington, DC', 698: 'Provo, UT', 701: 'New York, NY', 704: 'Des Moines, IA', 706: 'Putney, VT', 707: 'St. Louis, MO', 710: 'Wilkes-Barre, PA', 715: 'Charleston, SC', 716: 'Brooklyn, NY', 717: 'Whiting, IN', 720: 'Visalia, CA', 731: 'Sioux City, IA', 734: 'Chicago, IL', 735: 'Raleigh, NC', 738: 'New York, NY', 743: 'Jefferson City, MO', 744: 'Taunton, MA', 746: 'Philadelphia, PA', 747: 'New Haven, CT', 748: 'Montclair, NJ', 749: 'Gary, IN', 751: 'Los Angeles, CA', 756: 'Palo Alto, CA', 758: 'Corvallis, OR', 759: 'New York, NY', 764: 'Wichita, KS', 767: 'Chicago, IL', 769: 'Plains, GA', 770: 'Washington, DC', 773: 'Northfield, MN', 774: 'Burlington, MA', 775: 'Burlington, MA', 776: 'Syracuse, NY', 779: 'Sidney, OH', 787: 'Brooklyn, NY', 790: 'Glens Falls, NY', 791: 'Glens Falls, NY', 793: 'New York, NY', 794: 'Seattle, WA', 796: 'Washington, DC', 797: 'New York, NY', 798: 'New York, NY', 800: 'Possum Trot, KY', 801: 'Berne, IN', 803: 'Oakland, CA', 810: 'New York, NY', 811: 'Denver, CO', 812: 'Denver, CO', 815: 'St. Louis, MO', 816: 'Evanston, IL', 818: 'Stanford, CA', 819: 'New Haven, CT', 822: 'Roanoke, VA', 823: 'Yukon, FL', 826: 'New York, NY', 827: 'Boston, MA', 834: 'Washington, DC', 840: 'Chicago, IL', 841: 'New York, NY', 842: 'New York, NY', 843: 'New York, NY', 854: 'Milwaukee, WI', 855: 'Milwaukee, WI', 857: 'Los Angeles, CA', 858: 'Los Angeles, CA', 859: 'Superior, WI', 862: 'San Diego, CA', 866: 'Honolulu, HI', 870: 'White Plains, NY', 871: 'Springfield, MA', 874: 'New York, NY', 875: 'Enterprise, OR', 876: 'Enterprise, OR', 884: 'Pasadena, CA', 885: 'Washington, DC', 887: 'Chicago, IL', 888: 'Chicago, IL', 894: 'Champaign-Urbana, IL', 895: 'Champaign-Urbana, IL', 896: 'Missoula, MT', 897: 'Washington, DC', 898: 'Washington, DC', 899: 'New York, NY', 900: 'New York, NY', 901: 'Little Falls, MN', 902: 'New York, NY', 903: 'New York, NY', 904: 'Cambridge, MA', 912: 'Milwaukee, WI', 913: 'Milwaukee, WI', 918: 'Boston, MA', 919: 'Urbana, IL', 920: 'Detroit, MI', 922: 'Haverhill, MA', 923: 'St. Paul, MN', 924: 'St. Paul, MN', 930: 'Ann Arbor, MI', 933: 'Pleasanton, CA', 936: 'New York, NY', 947: 'Raton, NM', 948: 'Raton, NM', 963: 'Duluth, MN'} is not of type 'string'
        
In [ ]: